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Tlie ciiirally improved (CI) quark propagator in Landau gauge is calculated in two flavor lattice 
Quantum Chromodynamics. Its wave-function renormalization function Z{p'^) and mass function 
M{p^) are studied. To minimize lattice artifacts, tree-level improvement of the propagator and tree- 
level correction of the lattice dressing functions is applied. Subsequently the CI quark propagator 
under Dirac operator low-mode removal is investigated. The dynamically generated mass in the 
infrared domain of the mass function is found to dissolve continuously as a function of the reduction 
level and strong suppression of Z{p'^) for small momenta is observed. 
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I. INTRODUCTION 

The quark propagator is one of the fundamental ob- 
jects in Quantum Chromodynamics (QCD). The mass 
function of the quark propagator reveals the value of the 
running quark mass in the deep ultraviolet (UV) where 
interactions are weak due to the asymptotic freedom of 
QCD. In the infrared (IR), the dynamical generation of 
mass which is associated with the spontaneous breaking 
of chiral symmetry is exhibited by the mass function. 
The IR is not accessible with perturbative methods; lat- 
tice QCD provides a nonperturbative ab initio approach 
to QCD and thus is a well adapted tool to study the IR 
physics of the strong nuclear force. 

The quark propagator is a gauge dependent object and 
thus the gauge has to be fixed in order to study its proper- 
ties; we adopt the manifestly Lorentz covariant Landau 
gauge for the present work. The Landau gauge quark 
propagator has been studied on the lattice with vari- 
ous fermionic actions. Some initial investigations using 
(improved) Wilson fermions have been reported in Ref. 
1 , [4I . A series of studies using standard Kogut-Susskind 
3 1 and Asqtad [1] quarks found that staggered quarks are 
well suited to explore the properties of the quark propa- 
gator on the lattice 

Lattice Dirac operators that fulfill the Ginsparg- 
Wilson (GW) equation allow for lattice fermions that 
have an exact chiral symmetr y a. t nonzero lattice spac- 
ing. The overlap operator pllIT^ provides a solution to 
the GW equation. The quark propagator from the over- 
lap action has been examined in 113-19]. The drawback 
of overlap fermions is their very high computational cost 
which renders them impractical for full dynamical simu- 
lations. 

In this letter we analyze the quark propagator from the 
so-called chirally improved (CI) Dirac operator [2l| 
which fulfills the GW equation not exactly, but only ap- 
proximately. Nevertheless, the gain in simulation time of 
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roughly one order of magnitude, in comparison to over- 
lap fermions, allows for an investi gat ion of the propagator 
on full dynamical configurations p2l . [23j . The better chi- 
ral properties of the CI operator as opposed to Wilson's 
fermion action make it well suited to explore effects of 
spontaneous chiral symmetry breaking on the lattice. 

Banks and Casher formulated a relation of the density 
of the smallest nonzero eigenvalues of the Dirac operator 
to the chiral condensate [2^ . In [2^ we have studied the 
effects of removing the lowest eigenmodes of the Hermi- 
tian CI Dirac operator 75£'ci on the meson spectrum 
and found signals for the restoration of chiral symmetry 
(the masses of the p and ai became approximately de- 
generate, cf. |26|) w hereas confining properties persisted. 
The authors of [27| expand the Wilson loop in terms of 
Dirac operator eigenmodes and detect that removing the 
lowest modes does not infiuence the static quark poten- 
tial qualitatively. 

A portion of this study aims at answering the ques- 
tion, how change the quark wave-function renormaliza- 
tion function Z{p^) and the quark mass function M{p'^) 
under Dirac low-mode removal? It is expected that the 
mass function flattens out in the IR once chiral symmetry 
is restored. Yet another question of interest is how the 
Dirac eigenmode truncation level at which chiral sym- 
metry was found to be approximately restored (2^ , com- 
pares to the loss of dynamical mass generation in M{p'^) 
as a function of the truncation level. 

The remainder of this work is as follows: in Sec. HIl we 
briefly summarize the defining equations of lattice Lan- 
dau gauge fixing. In Sec. IIIII we first remind the reader 
of the main steps in the construction of the Dci oper- 
ator, followed by a discussion of Zip^) and M{p^) from 
the Dci at tree-level and in the full interacting case. In 
order to reduce the dominant lattice artifacts we apply 
tree-level improvement and test a multiplicative and an 
hybrid scheme of tree- level correction. In Sec. IIVI we in- 
vestigate Z{p^) and M{p^) from the Dci under Dirac 
low-mode removal and in Sec.|V]we summarize and con- 
clude. 
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II. GAUGE FIXING 



A. The CI Dirac operator 



The continuum Landau gauge condition, 



(1) 



can be realized on the lattice by requiring the maximiza- 
tion of the gauge functional 



(2) 



fl.X 



with respect to gauge transformations g{x) £ SU(3) 
where 



U^^ix)^g{x)U,{x)g{x + fi)^. 



(3) 



The sum in Eq. ^ runs over the four Dirac components /i 
and all lattice sites x. Once such a gauge transformation 
is found, the discrete Landau gauge condition 



A(x) =^(A^(a;)-A^(x-A)) =0 



(4) 



holds, where ^^t(a;) is recovered from the lattice gauge 
links U^{x) via 



U^{x) ~ U^{x)^ 



2iago 



(5) 



tracclcss 



A measure for the achieved Landau gauge "quality" is 
given by 



(6) 



The so-called chirally improved Dirac operator Dqi 
was introduced in (20| and first analyzed in [21] where 
also its spectral properties were studied. An initial 
quenched hadron spectroscopy using the Dci was ex- 
amined in [29^ before dynamical configurations including 
two light degenerate CI quarks have been generated in 
order to calculate the hadron spectrum in a series of pa- 
pers [12, m, 113, HH . Renormalization factors of quark 
bilinears of the Dci were studied in [H, HI] . 

The CI Dirac operator is an approximate solution to 
the GW equation. It is constructed by expanding the 
most general Dirac operator in a basis of simple opera- 
tors. 



Dci [x, y) 
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'^(i7)r,+mol. 



(7) 



where the sum runs over all elements of the Clifford 
algebra. The coefficients ciy(/7) consist of path ordered 
products of the link variables U connecting lattice sites 
X and y. Inserting this expansion into the GW equation 
then turns into a system of coupled quadratic equations 
for the expansion coefficients of the Dci • That expansion 
provides for a natural cutoff which turns the quadratic 
equations into a simple finite system. 

The ansatz is constructed such that all symmetries of 
the fermionic action are maintained and moreover 75- 
hermiticity is imposed. The so-called clover term 3J] is 
included for 0{a) improvement where the Csw parameter 
is set to its tree- level value (one) . 



B. Configurations 



here the trace goes over the color indices, Nc is the num- 
ber of colors and V is the number of lattice points. In 
the later discussion of the CI quark propagator we will 
choose 6 < 10""'^° as the stopping criterion for the gauge 
fixing algorithm. 

We accelerate the costly task of lattice gauge fixing by 
utilization of the graphics processing unit (GPU) with 
NVIDIA® 's CUDA^" (Compute Unified Device Archi- 
tecture) programming environment, as pointed out in the 
Appendix El 

For a general discussion of lattice gauge fixing and its 
problems we refer to [2^. 



For the analysis of the CI quark propagator we use 125 
gauge field configurations 22, 23] of lattice size 16'^ x 32 
and lattice spacing a = 0.144(1) fm. The configurations 
include two light degenerate dynamical CI quark flavors 
with the mass parameter set to mo = —0.077 and a re- 
sulting bare AWI-mass of m = 15.3(3) MeV. For the 
simulation of the gauge fields as well as for our valence 
quarks, paths up to length four are used in the ansatz 
Eq. ([7]) and the corresponding coefficients can be found 
in [22]. 



C. Nonperturbative quark propagator 



III. THE CI QUARK PROPAGATOR 

In the present section we analyze the lattice dressing 
functions from the CI quark propagator after having re- 
peated the main steps in the construction of the CI Dirac 
operator. 



The continuum quark propagator at tree-level reads 

S^'^^p) = (i^ + m)-' (8) 

where m is the bare quark mass. In a manifestly covariant 
gauge like Landau gauge, the interacting renormalized 
quark propagator S{^;p) can be decomposed into Dirac 
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scalar and vector parts 

S{fi;p) = {iMif^-.p') + B{fi;p')y' (9) 
or equivalently as 

S{^l■,p) = Z{fi;p') {%i + M(j?)y^ . (10) 

In the last equation we introduced the wave-function 
renormahzation function Z(^]p^) = l/A(/i;p^) and the 
mass function M{p'^) = B{^;p'^) / A{pL]p'^). 

On the lattice, the regularized quark propagator is cal- 
culated and consequently it depends on the cutoff a. The 
regularized quark propagator Sl{p] a) can then be renor- 
malized at the renormalization point /i with the momen- 
tum independent quark wave-function renormalization 
constant Z2(/i;a), 

SL{p;a) ^ Z2{^i;a)S{fi;p). (11) 

Whereas the mass function M(p^) is independent of 
the renormalization point /x (and equivalently of the cut- 
off scale a), the wave-function renormalization function 
Z{^;p^) differs at different scales but can be related from 
different scales by multiplication with a constant, i.e., by 
the ratio of the two different quark renormalization con- 
stants. 

The momentum subtraction scheme (MOM) has the 
renormalization point boundary conditions Z(/x; fj,^) = 1 
and M(/i^) = m{fj,) where m(^) becomes the running 
mass at large momenta. 

Below we extract the nonperturbative functions M{p^) 
and Z{p'^) = Z2{pL]a)Z[ii;p'^) directly from a lattice cal- 
culation as it was discussed in great detail in, e.g., Ref. 
[35l |. We perform a cylinder-cut (ij] on all our data and 
average over the discrete rotational and parity symme- 
tries of Sl{p', a) to increase the statistics. 

D. The lattice quark propagator at tree- level 

For the sake of easier notation we will suppress the 
a dependence of the lattice quark propagator and write 
Zl (p) and Ml (p) as functions of p rather than p^ in the 
following discussion. 

The lattice quark propagator at tree-level S^f' (p) dif- 
fers from the continuum case, Eq. ([8]), due to discretiza- 
tion artifacts, 

sf\p)^(^af: + aMf^(J>)y\ (12) 

The dressing function A^''(p) is by construction equal 
to one at tree-level (at least without tree-level improve- 
ment) and thus the function B^^\p) equals at tree- level 
the mass function m'^\p). 

We extract the CI lattice momentum ak{p) from the 
tree-level propagator on the lattice and depict it in Fig.[T] 
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FIG. 1. CI lattice momentum ak(p) extracted from the tree- 
level propagator (crosses) compared to the analytical expres- 
sion (full line) given in Appendix [Bl 

The result is consistent with the analytically derived ex- 
pression for the Dqi momenta given in Appendix |B] 

The tree-level mass function aM^\p) which in the 
continuum equals the bare mass m, is shown in Fig. [2] 
(red crosses), again together with the corresponding an- 
alytical expression. We find that aM^^^ip) has a zero- 
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FIG. 2. The lattice quark propagator mass function at tree- 
level (red crosses and full line) and in the unimproved full 
interacting case (blue triangles) without tree-level correction. 
The tree-level results comprise a lattice extraction from the 
tree-level Dci (red crosses) and a plot of the analytical ex- 
pression of the mass function (red line) given in Appendix 

El 

crossing and aMf\Q) w —0.333. The latter value is 
trivially equal to the sum of all coefficients of Eq. ([7]) 
that come with a unit matrix in Dirac space 

^ Si +7710 (13) 

i 

whereby the bare mass parameter is mo — —0.077 and 
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the non-zero Si are listed in Appendix IB] 



must be included into the action: 



E. The interacting propagator 

We expect the interacting propagator to have a similar 
form to the continuum case Eq. (jlOp . hence we write 



(14) 



The functions aM^lp) and Zl{p) extracted from the lat- 
tice Monte Carlo simulation are shown in Fig. [5] and 
Fig. [3] (blue triangles), respectively. The shape of aA/i(p) 
is similar to aMf\p) and also Zl(j>) strongly deviates 
from the expected monotonically growing behavior, thus 
is clearly altered by discretization errors. 
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FIG. 3. The wave-function renormalization function Zl{p) of 
the CI quark propagator: unimproved and without tree-level 
correction (blue triangles) and with tree-level improvement 
and tree- level correction (red circles). The renormalization 
point is set at fj, = 2 GeV. 



To get a handle on the lattice artifacts, i.e., to retain 
the shapes of the wave- function renormalization function 
and the mass function familiar from earlier lattice works 
as well as from Dyson-Schwinger equation studies (36l |. 
we discuss improvement and tree-level correction in the 
forthcoming subsections. 



F. Improvement 

The Symanzik improvement program j37| offers a sys- 
tematic way to reduce the errors of the fermionic action 
to 0{a^). All terms that have the correct dimensional- 
ity and the symmetries of the QCD fermionic Lagrangian 



Li{x 
2.2 (a; 

L4{x 



TOtr [Ff^^Fi_^^] , 
m'^'ipip. 



(15) 

(16) 
(17) 

(18) 

(19) 



The terms L3 and L5 can be accounted for by a redefi- 
nition of the bare parameters m and g. L2 and L4 are 
only needed for off-shell quantities like hadronic matrix 
elements or the quark propagator "SS*]. Thus for on- 
shell quantities it is sufficient to take the clover term [39| 
(which corresponds to Li) into account. 

Note that whereas exact GW fermions are automat- 
ically 0{a) improved, the CI operator fulfills the GW 
equation only approximately and thus the clover term is 
included in the CI action. 

Since the quark propagator is an off-shell quantity we 
would like to include the terms L2 and L4 as well. In [i^ 
it is shown that at tree-level L2 and L4 can be eliminated 
by a transformation of the fermion fields according to 



a 

1 H — m 
4 



1 



V' -> ( 1 + ^m) V (1 + 



1^. 



(20) 
(21) 



Improvement beyond tree-level requires tuning of the 
coefficients of the fermion field transformations 41] which 
we do not attempt. Hence we adopt the above fermion 
field transformations under which the quark propagator 
turns into 1, 2] 

Si{x, y) = ((1 + am)S[x, y- U) - '^5{x, y)) (22) 



where the index / denotes improvement. In Eq. \22\ . 
S{x,y; U) is obtained by inverting the Dqi operator on 
each configuration and the brackets denote Monte Carlo 
integration over the gauge fields U . 

All results that follow have been tree-level improved 
according to the above prescription. 



G. Tree-level correction 

In order to blank out the lattice artifacts which are 
already present at tree-level, we now focus on the deriva- 
tion of the interacting propagator from its tree-level form. 

For the renormalization function Zl{p) we adopt a 
multiplicative tree-level correction 



Zl{p) 



Zl{p) 

zTip) 



(23) 



As can be seen in Fig. [3] (red circles), this procedure 
together with the tree-level improvement from the pre- 
vious subsection fiattens Zl{p), hence reduces the domi- 
nant lattice artifacts. However, the fact that the function 
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is still not monotonically growing indicates that the im- 
provement coefheients are not sufficiently adjusted to re- 
move all 0{a) errors when simply picking their tree- level 
values. 

In order to apply a multiplicative tree-level correction 
to the mass function of the form 



aML{p) 



amML (p) 



(24) 



we have to carry out an additive mass renormalization 
of the tree-level function B^f^ (p) in order to avoid diver- 
gences, i.e., 



(0), 



where amadd is chosen such that i?^°''(0) — m, like in the 
continuum, thus 



(25) 



amadd = am - aBf\o) w 0.344 



(26) 



As a result, the multiplicative tree-level correction for the 
mass function is 



aML{p) 



amML{p)Af\p) 



(27) 



Alternatively, we may adopt an hybrid tree-level cor- 
rection which is based on the ideas developed in Ref . : 
if p < p' , then perform an additive tree- level correction 



aA'lLip) aMLip) 



aB^^ {p) + amadd 



(28) 



and for momenta larger than p' apply a multiplicative 
tree-level correction 



aMLip) 



amML{p)Af\p) 



(29) 



The momentum parameter p' should be adjusted thereby 
such that Ml{p) is continuous and smooth at p — p' 
which we found to be the case for p' = 1.5 GeV. 

Both possibilities of tree-level correction for the mass 
function Ml{p) are plotted in Fig. S] We observe that 
the pure multiplicative correction (blue crosses) results 
in an infrared enhanced function, enhancement occur- 
ring from 1.25 GeV on downwards and appearing to be 
rather steep. The hybrid scheme (red circles), on the 
other hand, exhibits a wider range of IR mass generation 
(from 2.5 GeV on downwards), gives a higher IR mass 
and yields flattening of the mass function in the deep IR. 
The hybrid scheme allows for an earlier mass generation 
due to the fact that the multiplicative correction therein 
(for p > p') does not require an additive mass renormal- 
ization since the zero-crossing of the tree-level function 
is handled by the additive tree- level correction (p < p'). 



When comparing these results to lattice quark prop- 
agator studies from a different fermionic action, for ex- 
ample to the (quenched) overlap quark propagator [Tsl - 
[T9| . we find better agreement for the hybrid scheme. It 
has to be stressed however that the parameter p' intro- 
duces a small arbitrariness to the procedure whereas the 
simpler pure multiplicative scheme provides a straight- 
forward comparison of the interacting mass function to 
its tree-level counterpart while still yielding qualitatively 
the correct physics. Consequently, for the next section we 
adopt the simpler multiplicative scheme for the analysis 
of the effects of Dirac low-mode removal on the quark 
propagator mass function in order to avoid possible sys- 
tematic errors related to the tuning of p' . 
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FIG. 4. The CI quark propagator mass function Ml(p) after 
improvement and application of a pure multiplicative (blue 
crosses) and an hybrid (red circles) tree-level correction pro- 
cedure. 



IV. RESTORATION OF CHIRAL SYMMETRY 

The lowest Dirac eigenmodes are known to play a 
crucial role for dynamical chiral symmetry breaking as 
stated by the Banks-Casher relation . The latter re- 
lates the chiral condensate to the density of the smallest 
nonzero Dirac eigenmodes. As a consequence, when re- 
moving the Dirac eigenmodes near the origin from the 
theory, the chiral condensate vanishes and chiral symme- 
try becomes "artificially restored" [25^ . 

The aim of the current work is to analyze the effects 
of artificial chiral restoration on the dressing functions 
of the quark propagator. Consider the Hermitian Dirac 
operator = 75I? which is normal and thus has real 
eigenvalues Hi . D can be written in terms of the spectral 
representation of Ds, 



i=l 



(30) 
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We split the quark propagator S = into a low-mode 
part (Im) and a reduced part (red), e.g., using the eigen- 
values and eigenvectors of D5, 

S = Y^ f^i^ \vi) {vi\ 15 + f^i^ \vi) {vi\ 75 (31) 



'lm(fc) 



:d(fc) 



(32) 



Hence we can obtain the reduced part of the propagator 
by subtracting the low-mode part from the full propaga- 
tor 



'S'rcd(*:) — S — 5'ini(fe). 



(33) 



We calculate the quark wave-function renormalization 
function (p) and the quark mass function Ml (p) from 
the reduced propagators of Eq. psp with varying reduc- 
tion levels k — 2 — 512. We tree-level improve the mod- 
ified propagators and apply the multiplicative tree-level 
correction scheme, cf. Sec. IIIII The dressing functions 
from reduced propagators are presented in Fig. [5] and 
Fig.il 



1.4 



1.2 
1 

0.8 
0.6 
0.4 
0.2 




iiiff I ■ 

m • t * ^ 



■ 



full 

red(2) 

red(4) 

red(8) 

red(16) 

red(32) 

red(64) 

red(96) 

red(128) 

red(256) 

red(384) 

red(512) 



0.5 



1.5 2 
P [GeV] 



2.5 



3.5 



FIG. 5. The quark wave-function renormalization function 
Zl{p) under Dirac eigenmode removal for different reduction 
levels k. The renormalization point is set at ^ = 4 GeV. 

Figure [S] reveals amplification of IR suppression of 
Zl{p) when subtracting Dirac low-modes whereas the 
range from medium to high momenta is not altered at 
all. The mass function Ml{p), Fig. [51 demonstrates a 
similar behavior: it gets suppressed in the IR when re- 
moving more and more eigenmodes until the dynamic 
generation of mass completely ceases at truncation stage 
k w 128. 

In Fig. [7] we compare the deep IR mass of the CI quark 
propagator from AIl{p'^^^), at the smallest available mo- 
mentum Pmin — 0.1345 GeV, as a function of the reduc- 
tion level to the mass splitting of the vector meson p 
and its chiral partner the axial vector current oi, taken 
from Ref. [25|. Note that the reduction level k can be 
translated to an energy scale which is given in the lower 
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FIG. 6. The quark mass function Ml{p) under Dirac eigen- 
mode removal for different reduction levels k. 
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FIG. 7. The infrared mass ML(Pmin) of the reduced CI quark 
propagator as a function of the reduction level compared to 
the mass splitting between the p and the ai from Ref. [25}. 
The upper abscissa shows the truncation level k and the lower 
abscissa gives the corresponding energy scale, the relation be- 
tween the two is obtained by integrating the histograms of 
the Ds eigenvalues. 



abscissa of the figure and was derived in [2q | by integrat- 
ing the histograms of the eigenvalues. 

The mass splitting between the p and the ai rapidly 
drops down and reaches a plateau after subtracting about 
16 eigenmodes; it does not go down to zero which can 
most likely be attributed to the small explicit chiral sym- 
metry breaking by the nonvanishing quark mass. In 
contrast, the dynamically generated mass of the quark 
propagator, Mi(p^;jj), decreases slower and reaches its 
plateau only after subtracting more than 128 Dirac eigen- 
modes. 
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V. CONCLUSIONS 

The wave- function renormalization function Z(jP) and 
the mass function M(p'^) from the CI quark propagator 
have been analyzed on configurations with two hght de- 
generate CI quark flavors. It has been demonstrated that 
the combination of tree-level improvement and a multi- 
plicative or hybrid tree-level correction scheme drasti- 
cally reduce the dominant lattice artifacts. 

Removing the lowest Dirac eigenmodes out of the 
quark propagator strongly suppresses the wave-function 
renormalization function in the IR and completely dis- 
solves the dynamically generated mass displayed by 
MijP'). Under Dirac low-mode removal, the mass func- 
tion is found to reveal a smoother transition towards chi- 
ral restoration than the splitting of the vector and axial 
vector currents. 
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Appendix A: Gauge fixing on the GPU 

In the current Appendix we discuss how the process of 
lattice gauge fixing with the overrelaxation algorithm can 
be accelerated by using NVIDIA® 's CUDA (Compute 
Unified Device Architecture) programming environment 
for CPUs. We compare the performance of the overrelax- 
ation algorithm on one GPU (NVIDIA GeForce® GTX 
580) with conventional calculations on the CPU and ap- 
ply techniques to relax the bandwidth restrictions of the 
GPU. 

In the recent years, many groups in the lattice QCD 
community have taken advantage of the cost effective op- 
portunity to adopt CPUs for high-performance lattice 
QCD computations. Whereas the pioneering work of 
GPU calculations in lattice QCD was reported in [i^ . 
some more recent examples are given by |43l - l5Cl| . 



1. Gauge fixing via (over)relaxation 

The underlying idea of the relaxation algorithm [5l| is 
a local optimization of the gauge functional Fg\U], i.e.. 



for all X the maximum of 5He tr [g{x)K{x)] is wanted, 
where 

K{x) EE J2 {Uf.{x)g{x + A)t + U^ix - fi^gix - A)t) . 

(Al) 

The solution of the aforementioned subtask is given, in 
the case of the gauge group SU(2), by 



K{x)'^ 



^det [K{xY 



(A2) 



and for SU(3) one iteratively operates in the three sub- 
groups of SU(2). From equations (|A1I) and (|A2p it is ev- 
ident that one can optimize all sites in each of the black 
and white checker sub-lattices simultaneously. 

In order to reduce the critical slowing down of the re- 
laxation algorithm on large lattices, the authors of fs^ 
suggested to apply an overrelaxation algorithm which 
replaces the gauge transformation g{x) by g'^{x) in each 
step of the iteration. This method has widely been stud- 
ied and the value of w was found to be well adapted at 
around 1.7, see 123 and references therein. 



2. Lattice QCD on the GPU 

Since CUDA supports only lattices up to three dimen- 
sions natively, one single index that runs over all four 
dimensions of the space-time lattice is used. We as- 
sign each lattice site to a separate thread and start 32 
threads per multiprocessor. Better occupancy would be 
achieved with more threads per multiprocessor but we 
are restricted by the 48 KB of LI cache. 

A function which is called from the host system and 
which performs calculations on the GPU is called a ker- 
nel. We implemented two kernels, one which checks the 
current value of the gauge fixing functional, Eq. ^ , and 
the gauge precision, Eq. (jS]), after every 100th iteration 
step and a second which does the actual work, i.e., which 
performs an overrelaxation step. The latter is invoked 
for black and white lattice sites consecutively. 



3. Optimization 

The GPU can read data from global device memory in 
an efficient way only if the data is accurately coalesced; 
that means the largest memory throughput is achieved 
when consecutive threads read from consecutive memory 
addresses. In order to do so we rearrange the gauge field 
into two blocks, one for even and one for odd lattice sites. 
Moreover, for the same sake of memory coalescing, we 
choose the site index running faster than color and Dirac 
indices. 

Applying the overrelaxation algorithm to one lattice 
site requires 2253 floating point operations and we have 
to read and write eight SU(3) matrices for every site; thus 
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the required data transfer in single precision is 1152 bytes 
per site. Comparing the ratio data transfer per floating 
point operation, 1152/2253 w 0.51, with the theoreti- 
cal peak performance of the GTX 580, 192/1581 w 0.12, 
we clearly see that we are solely constrained by memory 
bandwidth and not by the maximum number of arith- 
metic instructions. 

In order to reduce memory traffic we make use of the 
unitarity of SU(3) matrices and reconstruct the third line 
of each matrix on the fly when needed instead of storing 
it (43I . A minimal 8 parameter reconstruction turned out 
to be numerically not stable enough for our purpose since 
we not only read but also write the modified gauge fields 
in each step of the iteration. For more details see [H, Q 
and references therein. 



4. Performance 

We generated quenched SU(3) configurations with the 
heat-bath algorithm with 2A^ spatial lattice sites and a 
varying temporal extent from 4 to 128. On these configu- 
rations we compare the performance of our GPU kernels 
on the GTX 580 in single and double precision to the per- 
formance of the equivalent code (with a data alignment 
more appropriate for the CPU) on one core of the InteP 
Core^" 17-950 ("Nehalem Bloomfield") processor run at 
3.06 GHz, whereby the CPU code is optimized through 
SSE4 instructions generated by the compiler. 




32 64 96 128 32 64 96 128 

lalticc size [x24"^] lallice size f>L24^] 



FIG. 8. Performance of the overrelaxation algorithm for fixing 
the gauge on the GPU (NVIDIA GeForce GTX 580) in single 
precision (SP) and double precision (DP) with and without 
the 12 parameter reconstruction for SU(3) matrices described 
in the text, compared to the performance on one core of the 
Intel Core 17-950 processor (CPU) in single precision. On the 
l.h.s. shown in terms of Gflops and on the r.h.s. in terms of 
time (seconds) to solution. 

The results of the performance test are given in 
Fig. [51 in the l.h.s. plot we show the performance of the 
algorithm using a 12 parameter reconstruction and a full 
18 number representation in single and double precision 
together with the performance of the same code run on 
the CPU in single precision. We achieve a maximum 
performance of 135 Gflops (independent of the lattice 



Using the Intel compiler (12.0.0) with the compiler flag SSE4.2. 



size) for the case of the 12 parameter reconstruction in 
single precision. On the r.h.s. of Fig. [5] we present a 
summary of the time needed to fix the gauge for the 
various settings up to the test accuracy of 6* < 10~^. 
Overall, we find that for the task of gauge fixing with 
the overrelaxation algorithm the computational power of 
one GPU is equivalent to approximately 40 CPU cores 
(under the assumption of ideal scaling). 



Appendix B: Analytical expressions for the 
tree-level CI Dirac operator 

At tree-level, the tensor, axialvector and pseudoscalar 
terms of Eq. ^ vanish identically and only scalar and 
vector terms remain [2^ . When transformed to momen- 
tum space one obtains the following analytical expres- 
sions for the latter two: the scalar part, i.e., the tree-level 
mass function which is plotted in Fig. [5] is given by 

a4°'(p) -S1+48S13 

+ (2s2 + 12s8)(cos(po) + cos(pi) + cos(p2) + cos(p3)) 
+ (8s3 + 64sii)(cos(po) cos(pi) -I- cos(po) cos(p2) 

-I- COs(po) C0S(P3) -I- COs(pi) COs(p2) + COs(pi) C0s(p3) 
-|-C0S(P2)C0S(P3)) 

-I- 48s5(cos(po) cos(pi) cos(p2) + cos(po) cos(pi) cos(p3) 

-I- COs(po) C0S(P2) C0s(p3) -|- COs(pi) C0s(p2) COs(p3)) 

-I- 8s6(cos(po) cos(2pi) -I- cos(po) cos(2p2) 
-I- cos(po) cos(2p3) -I- cos(pi) cos(2p2) 
-I- cos(pi) cos(2p3) -I- cos(p2) cos(2p3) 
+ cos(2po) cos(pi) + cos(2po) cos(p2) 
+ cos(2po) cos(p3) + cos(2pi) cos(p2) 
-I- cos(2pi) cos(p3) + cos(2p2) cos(p3)) 
-I- 384sio cos(po) cos(pi) cos(p2) cos(p3) 
+ mo, 

where the relevant coefficients are listed in Table HI In 
the same manner, the analytical expressions of the lattice 
momenta A;^(p^) from Fig. [1] read 

ko = 2vi sin(po) + 8^2 sin(po)(cos(pi) + cos(p2) + cos(p3)) 
-I- (32u4 + I6V5) sin(po)(cos(pi) cos(p2) + cos(pi) cos(p3) 

-hC0s(p2)c0s(p3)), 

ki = 2vi sin(pi) + 8^2 sin(pi)(cos(po) + cos(p2) + cos(p3)) 
-f (32^4 + 16i;5) sin(pi)(cos(po) cos(p2) + cos(po) cos(p3) 

-f C0S(P2)C0S(P3)), 

k2 = 2vi sin(p2) + 8w2 sin(p2)(cos(po) + cos(pi) -I- cos(p3)) 
+ {32v4 + I6v<i) sin(p2)(cos(po) cos(pi) -I- cos(po) cos(p3) 

+ C0s(pi)c0s(p3)), 

ks = 2vi sin(p3) -I- 8^2 sin(p3)(cos(po) + cos(pi) + cos(p2)) 
-I- (32^4 + I6V5) sin(p3)(cos(po) cos(pi) -I- cos(po) cos(p2) 

-HC0s(pi)c0s(p2))- 
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The wave-function renormalization function is equal to TABLE I. The relevant Dci coefficients. For a complete de- 
one at tree- level by construction. scription see 



Sl 


0.1481599252 x 10^ 


82 


-0.5218251439 x 10"^ 


S3 


-0.1473643847 x 10"^ 


S5 


-0.2186103421 x 10~^ 


S6 


0.2133989696 x 10^^ 


S8 


-0.3997001821 x 10"^ 


SlO 


-0.4951673735 x 10~^ 


Sll 


-0.9836500799 x 10"^ 


Sl3 


0.7529838581 x 10-^ 


Vl 


0.1972229309 x lO" 


V2 


0.8252157565 x 10"^ 


V4 


0.5113056314 x 10"^ 


V5 


0.1736609425 x 10"^ 


mo 


-0.077 
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